- We have been considering main/linear effects
- What’s the effect of x1 on y? Of x2?
- Yes, they can change due to collinearity
- But it’s still a main effect:
- For a given dataset, we modelled the effect of x as unchanging
\(y \sim b_0 + b_1*x + b_2*x^2\)
Can you think of any reasonable examples?
What’s the harm in ignoring it?
When should I care? THEORY (+ fit)
Try to improve fit, but only if it’s worth it
BIC = one measure of “worth it”; prefers more parsimonious models
mod_linear <- lm(y~x,data_quad) summary(mod_linear)
## ## Call: ## lm(formula = y ~ x, data = data_quad) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.2013 -0.4230 -0.0765 0.5587 4.3390 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.22172 0.18532 1.196 0.234 ## x 1.77471 0.07086 25.044 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.416 on 98 degrees of freedom ## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8635 ## F-statistic: 627.2 on 1 and 98 DF, p-value: < 2.2e-16
Try to improve fit, but only if it’s worth it
BIC = one measure of “worth it”; prefers more parsimonious models
mod_quadratic <- lm(y~x+I(x^2),data_quad) summary(mod_quadratic)
## ## Call: ## lm(formula = y ~ x + I(x^2), data = data_quad) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.9126 -0.3740 -0.0976 0.3809 5.5373 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.12406 0.17037 0.728 0.468 ## x 2.29383 0.13086 17.529 < 2e-16 *** ## I(x^2) -0.11373 0.02493 -4.562 1.48e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.292 on 97 degrees of freedom ## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8864 ## F-statistic: 387.4 on 2 and 97 DF, p-value: < 2.2e-16
## df BIC ## mod_linear 3 365.2070 ## mod_quadratic 4 350.3721
Lower BIC = better
| Difference in BIC | Claim |
|---|---|
| <2.0 | Weak |
| 2.0–5.9 | Positive |
| 6.0–9.9 | Strong |
| ≥10.0 | Very strong |
https://sites.stat.washington.edu/raftery/Research/PDF/socmeth1995.pdf
Note: this is not about a correlation betw. gender and feed. What would that even mean?
mod1 <- lm(weight ~ gender + feed,
data=fish_data)
coef(mod1)
## (Intercept) gendermale feednew ## 1992.2421 394.4623 320.4373
## (Intercept) gendermale feednew gendermale:feednew ## 2009.6081 396.0327 400.6258 -215.9386
## # A tibble: 4 x 3 ## # Groups: gender [2] ## gender feed m ## <chr> <chr> <dbl> ## 1 female control 2010. ## 2 female new 2410. ## 3 male control 2406. ## 4 male new 2590.
What if we exclude the interaction term?
Run a linear regression with just the two main effects (data: fish_data_interaction)
## ## Call: ## lm(formula = weight ~ gender + feed, data = fish_data_interaction) ## ## Residuals: ## Min 1Q Median 3Q Max ## -618.07 -136.60 -1.32 135.55 713.65 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2063.593 8.027 257.10 <2e-16 *** ## gendermale 288.063 9.268 31.08 <2e-16 *** ## feednew 292.656 9.268 31.58 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 207.2 on 1997 degrees of freedom ## Multiple R-squared: 0.4957, Adjusted R-squared: 0.4952 ## F-statistic: 981.5 on 2 and 1997 DF, p-value: < 2.2e-16
New example: you’re testing whether a new “cognitive enhancement” drug speeds up reaction times in a decision making task. Control group gets a placebo; experimental group gets the drug. Pre- and post-treatment measures. Assuming that the drug works:
Does it look like there’s an effect of the drug? Yes!
So do you expect:
drug_data)drug_data)## ## Call: ## lm(formula = rt ~ group + phase, data = drug_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1603.52 -322.26 -13.99 313.68 2061.26 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1406.18 19.16 73.410 <2e-16 *** ## groupexpt -206.97 22.12 -9.357 <2e-16 *** ## phaseposttreatment -200.67 22.12 -9.073 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 494.6 on 1997 degrees of freedom ## Multiple R-squared: 0.07839, Adjusted R-squared: 0.07747 ## F-statistic: 84.93 on 2 and 1997 DF, p-value: < 2.2e-16
## ## Call: ## lm(formula = rt ~ group * phase, data = drug_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1503.55 -308.91 -1.22 301.64 1961.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1306.212 21.667 60.287 <2e-16 *** ## groupexpt -7.028 30.641 -0.229 0.819 ## phaseposttreatment -0.733 30.641 -0.024 0.981 ## groupexpt:phaseposttreatment -399.879 43.333 -9.228 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 484.5 on 1996 degrees of freedom ## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1148 ## F-statistic: 87.39 on 3 and 1996 DF, p-value: < 2.2e-16
By default:
So what does 0 mean here?
By default:
So what can you do?
drug_data %>% mutate(rt_scaled = scale(rt)) %>% head
## id group phase rt rt_scaled ## 1 1 control baseline 1659.2623 0.88730013 ## 2 2 control baseline 620.3499 -1.13026810 ## 3 3 control baseline 2134.6126 1.81043075 ## 4 4 control baseline 912.6984 -0.56252727 ## 5 5 control baseline 1233.8626 0.06117378 ## 6 6 control baseline 1290.1337 0.17045225
fish_data_interaction %>% mutate(gender_centered = myCenter(gender)) %>% head
## gender feed weight gender_centered ## 1 male control 2358.956 0.5 ## 2 female control 2216.687 -0.5 ## 3 male control 2616.321 0.5 ## 4 female control 1844.046 -0.5 ## 5 male control 1898.898 0.5 ## 6 female control 2124.016 -0.5
## ## Call: ## lm(formula = rt ~ phase_centered * group_centered, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1503.55 -308.91 -1.22 301.64 1961.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1202.36 10.83 110.987 <2e-16 *** ## phase_centered -200.67 21.67 -9.262 <2e-16 *** ## group_centered -206.97 21.67 -9.552 <2e-16 *** ## phase_centered:group_centered -399.88 43.33 -9.228 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 484.5 on 1996 degrees of freedom ## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1148 ## F-statistic: 87.39 on 3 and 1996 DF, p-value: < 2.2e-16
No, it depends on your research question
Imagine we’d called it “before” and “after” instead of “baseline” and “posttreatment”
Imagine we’d called it “before” and “after” instead of “baseline” and “posttreatment”
## ## Call: ## lm(formula = rt ~ phase * group, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1503.55 -308.91 -1.22 301.64 1961.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1305.479 21.667 60.253 <2e-16 *** ## phasebefore 0.733 30.641 0.024 0.981 ## groupexpt -406.907 30.641 -13.280 <2e-16 *** ## phasebefore:groupexpt 399.879 43.333 9.228 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 484.5 on 1996 degrees of freedom ## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1148 ## F-statistic: 87.39 on 3 and 1996 DF, p-value: < 2.2e-16
fish_data_interaction %>%
mutate(gender = factor(gender, levels=c("male", "female"))) %>%
lm(weight ~ gender,.) %>%
summary
## ## Call: ## lm(formula = weight ~ gender, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -764.40 -178.29 -0.44 171.38 859.98 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2497.984 8.023 311.37 <2e-16 *** ## genderfemale -288.063 11.346 -25.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 253.7 on 1998 degrees of freedom ## Multiple R-squared: 0.2439, Adjusted R-squared: 0.2436 ## F-statistic: 644.6 on 1 and 1998 DF, p-value: < 2.2e-16
What should my sample size be if I need to test for an interaction?
Big
VERY INFORMAL rule of thumb: if you need \(N\) to detect main effect \(b\), might need \(4N\) to detect interaction size \(b\)
We’ll worry more about power tomorrow